22 research outputs found

    GABAC : An arithmetic coding solution for genomic data

    Get PDF
    Motivation: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. Results: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. © 2019 The Author(s). Published by Oxford University Press

    Design considerations for workflow management systems use in production genomics research and the clinic

    Get PDF
    Abstract The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance

    Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

    No full text
    <div><p>An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at <a href="http://github.com/zstephens/neat-genreads" target="_blank">github.com/zstephens/neat-genreads</a>.</p></div

    SNP substitution frequency matrices for breast cancer model.

    No full text
    <p>The label for each 4 × 4 matrix specifies the nucleotide immediately preceding and following the SNP position. For example, row 3 column 2 of the “A_A” matrix specifies the frequency of AGA mutating into ACA, as observed in the breast cancer SSM dataset.</p

    Empirical insert size distribution from two example BAM files.

    No full text
    <p>(Left) ICGC donor DO35138: <a href="http://dcc.icgc.org/donors/DO35138" target="_blank">dcc.icgc.org/donors/DO35138</a>, (Right) ICGC donor DO221544: <a href="http://dcc.icgc.org/donors/DO221544" target="_blank">dcc.icgc.org/donors/DO221544</a>, both from project PACA-CA.</p

    SNP substitution frequency matrices for Melanoma model.

    No full text
    <p>Note the strong preference for G → A and C → T transitions, as observed in existing work [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0167047#pone.0167047.ref017" target="_blank">17</a>].</p
    corecore